A General Metric for Riemannian Manifold Hamiltonian Monte Carlo 
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Markov Chain Monte Carlo (MCMC) is an invaluable means of inference with complicated mod- 
els, and Hamiltonian Monte Carlo, in particular Riemannian Manifold Hamiltonian Monte Carlo 
(RMHMC), has demonstrated impressive success in many challenging problems. Current RMHMC 
implementations, however, rely on a Riemannian metric that limits their application to analytically- 
convenient models. In this paper I propose a new metric for RMHMC without these limitations and 
verify its success on a distribution that emulates many hierarchical and latent models. 



Riemannian Manifold Hamiltonian Monte Carlo pro- 
vides a powerful tool for the efficient sampling from com- 
plex distributions, but the applicability of existing ap- 
proaches has been limited by the dependency on the 
Fisher-Rao metric. In this paper I introduce a new met- 
ric that admits a general implementation of Riemannian 
Manifold Hamiltonian Monte Carlo and demonstrate its 
efficacy on a distribution that mirrors the pathological 
behavior of common models. 



I. HAMILTONIAN MONTE CARLO 

Hamiltonian Monte Carlo (HMC) takes advantage 
of symplectic geometry to yield efficient Markov tran- 
sitions [l[. Augmenting the parameters of an N- 
dimensional target density, 7r(q), with corresponding mo- 
menta, p, defines a joint density, 

7r(p,q) = 7r(p|q)7r(q) 

= cxp [log7r(p|q)] exp [log7r(q)] 
cxexp [-T(p,q)] exp [-V(q)} 
= exp [-H(p, q)] . 



The Hamiltonian, H(p, q) 
trajectories between points z 
equations 

dq 

~dt ~ ^ 
dp 

~dt ~ 



= T(p,q) + V(q), defines 
= {p, q} via the differential 

OH 
dp 
dH 
<9q 



Because these trajectories preserve the value of the 
Hamiltonian and the differential volume d 2N z, they also 
define Markovian transitions with the stationary den- 
sity 7r(p, q). Alternating this Hamiltonian evolution with 
conditional samples of the momenta, 

P ~ 7r(p|q) cx exp [-T(p, q)] , 

yields an ergodic Markov chain sampling from z and, 
because the marginal of 7r(p, q) is constructed to be the 



target distribution, the desired samples from 7r(q) follow 
by simply disregarding the momenta. 

No matter the choice of the kinetic energy, T(p, q), the 
evolution equations incorporate the gradient of the po- 
tential, V(q), and hence higher order information about 
the target distribution. This gradient guides the Markov 
chain along regions of high probability mass and reduces 
random walk behavior. Note that, in practice, the Hamil- 
tonian evolution cannot be performed analytically and we 
must resort to numerical integration. Error in the inte- 
gration scheme introduces bias into the transitions, but 
this is readily avoided by considering the evolution not as 
a transition but rather as the proposal for a Metropolis 
transition HB- 

The first [2| and still most common choice of the con- 
ditional density, 7r(p|q), is a standard gaussian, 

7r(p|q)=jV(p|0,M) J 



or 



T(p,q) = ^p T -M- 1 - P , 



(1) 
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where the mass matrix M allows for a global decorre- 
lation and rescaling of the parameters with respect to 
each other. This choice, however, ultimately limits the 
effectiveness of HMC when applied to intricate target dis- 
tributions. Because p T • M _1 • p is a \ 2 variate, in equi- 
librium AT m N/2 and, with the Hamiltonian conserved 
along each trajectory, this implies that the variation in 
the potential is also limited to AV « N/2. When the 
target distribution is highly correlated, the typical set 
spans a potential gap much larger than this: the result- 
ing samples become highly correlated no matter how long 
the trajectories are evolved [|| and the Markov chain de- 
volves towards a random walk. 

Another issue with the simple choice (QJ is that the in- 
evitable numerical integration introduces a spatial scale 
into the system via a finite step-size. Complicated target 
distributions will typically exhibit multiple spatial scales 
depending on the particular value of the parameters, and 
any single choice of a step-size will generate at least some 
inefficiency. If the step-size is chosen to maximize effi- 
ciency, as common in adaptive schemes, regions of the 
target distribution with large curvature, and hence small 
spatial scales, can be missed entirely by the numerical 
trajectories. 
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These weaknesses can be overcome by appealing to a 
more sophisticated choice of the conditional density: a 
gaussian conditionally dependent on the q through a co- 
variance matrix, 

7r(p|q)=JV(p|0 J S(q)) ) 



or 



T(p,q) 



s- x (q) 



log|S(q) 



Because the resulting Hamiltonian trajectories arc re- 
lated to geodesies on a Riemannian manifold with met- 
ric S(q), this choice is known as Riemannian Manifold 
Hamiltonian Monte Carlo (RMHMC) Q. Similarly, the 
constant metric of ([1]) can be thought of as emulating 
dynamics on a Euclidean manifold, and to be consistent 
I will refer to use of the simpler Hamiltonian as Euclidean 
Manifold Hamiltonian Monte Carlo (EMHMC). 

The freedom in specifying a metric admits two signif- 
icant improvements: a proper choice of S(q) can dy- 
namically decorrelate and rescale the target distribution 
to avoid inefficiencies in the numerical integration, while 
also yielding a dynamic determinant whose variations can 
compensate for much larger variations in the potential. 

What, however, exactly defines a proper choice for the 
metric? When the target distribution is a multivariate 
gaussian, 

V(q) = V • S- 1 • q, 



One way to avoid indefinite metrics is to take advan- 
tage of any conditioning variables, y, in the target distri- 
bution. Marginalizing the Hessian over these condition- 
ing variables yields the Fisher- Rao metric Q , 



d 2 V(<i\y) 
dq l dqj 



which is guaranteed to be positive-semidefinitc. For all 
but the simplest conditional distributions, however, the 
marginalization is unfeasible and, even when it can be 
performed analytically, the resulting metric can still be 
singular. Moreover, the marginalization removes the cor- 
relation between variables in many hierarchical and la- 
tent models, almost eliminating the effectiveness of the 
metric. Of course, all of this is immaterial if the target 
distribution lacks natural conditioning variables. 

We need a means of constructing a metric from the 
Hessian that is not only everywhere well-behaved but also 
practical to compute for any given target distribution. 



II. THE SOFTABS METRIC 

With a careful application of matrix functions, it is 
possible to maintain the desirable behavior of the Hes- 
sian in convex neighborhoods while avoiding its singular 
behavior elsewhere. Moreover, because the functions are 
local the resulting metric is readily implemented for gen- 
eral distributions. 



the target distribution is standarized by taking S(q) = 
S _1 Q. In a convex neighborhood any target distribu- 
tion can be approximated by a multivariate gaussian, 



or, equivalently, 



^^(qlCH- 1 ) 



V(q) « -q T H 



with the Hessian matrix 

Hij = d 2 V/dq l dq\ 

which immediately motivates the candidate metric 
S(q) = H. _ 

This metric quickly runs into problems, however, when 
the target distribution is not globally convex. In neigh- 
borhoods where the Hessian is not positive-definite, for 
example, the conditional density 7r(p|q) becomes im- 
proper. Moreover, in the neighborhoods where the sig- 
nature of the Hessian changes, the log determinant di- 
verges and the Hamiltonian evolution becomes singular. 
These neighborhoods effectively partition the support of 
the target distribution into a disjoint union of compact 
neighborhoods between which the Markov chain cannot 
transition. 



A. Definition 

The exponential map , exp, is a matrix function from 
the space of all matrices to the component of the general 
linear group, GL(n), connected to the identity matrix: an 
isomorphism of the space of positive-definite matrices. 
Because this mapping preserves the symmetric part of 
the domain, any symmetric matrix, such as the Hessian, 
is guaranteed to be mapped to a symmetric, positive- 
definite matrix admissible as a Riemannian metric. 

One benefit of the exponential map is that it preserves 
the eigenbasis of the input matrix, X. If 

X = Q A Q T 

is the eigendecomposition of X with A = Diag(Ai) the 
diagonal matrix of eigenvalues and Q the corresponding 
matrix of eigenvectors, then the exponential map yields 

expX = Q • exp A • Q T . 

The metric cxpH provides the same dccorrclation as the 
Hessian but, unfortunately, also severely warps the eigen- 
values and the corresponding rescaling of the local pa- 
rameters. 

By combining multiple exponential mappings, how- 
ever, we can largely preserve the spectral decomposition 
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FIG. 1. The SoftAbs map preserves the eigenbasis of the Hes- 
sian but transforms the eigenvalues, A, with a smooth approxi- 
mation to the absolute value. The inverse of the regularization 
parameter, a, controls the "hardness" of the approximation; 
as a — y oo the SoftAbs map reduces to the exact absolute 
value. 



of the Hessian. In particular, the SoftAbs map 
IX.1 = 

[exp(aX) + cxp(— aX)] • X • [cxp(aX) — cxp(— aX)] 1 

approximates the absolute value of the eigenspectrum 
with a smooth function: 



IX.1 = Q • 1X1 ■ Q 1 



where 



1X1 = Diag A, 



gtvAi I g — aXi 



gaAi g— aXi 

Diag (Aj cothaAi) . 



This map not only ensures that the transformed eigenval- 
ues are positive but also regularizes any small eigenvalues 
that might introduce numerical instabilities (Figure [lj . 

Applying the SoftAbs map to the Hessian guarantees 
a well-behaved metric for RMHMC, IHl, that preserves 
the desired properties of the Hessian while regularizing 
its numerical singularities. In a practical implementa- 
tion, a limits the scaling of the integration step-size and 
restrains the numerical integrator from unwise extrapo- 
lations, emulating a trust region common in nonlinear 
optimization Q. 

This construction also motivates a family of approx- 
imate metrics with possible utility in circumstances of 
limited computational resources (Appendix IA|) . 



B. Implementation 

In practice, exponential maps can be difficult to imple- 
ment Q; the eigendecomposition used above, for exam- 
ple, can suffer from numerical instabilities when applied 
to general matrices because of ambiguities among the 



eigenvectors. The Hessian, however, is symmetric and 
the eigenvectors are guaranteed to be orthogonal. Conse- 
quently, the eigendecomposition is well-behaved and pro- 
vides a practical means of computing the SoftAbs map. 

To implement the SoftAbs metric we first perform the 
eigendecomposition of the Hessian 

H = Q A Q T , 

and then reconstruct the metric as 

IHl = Q 1X1 Q T , 

with l\l = Diag (A, cothaAi). 

Hamiltonian evolution also requires two derivatives: 
the gradient of the quadratic form, p T • IHl 1 • p, and 
the log determinant, log \ IHI\. 

The latter can be computed as 0, 



d(p T -IHl 1 • p 



= P T ■ aim • P 

= -p T -mr 1 -dim-mr 1 p 



= (Q T ■ p 

where o denotes the Hadamard product and 



[J o Q T OH Q] (Q T ■ p) 



A,- coth a\i 



Xj coth a\j 



Xi — A; 



Note that when A; = Xj , such as for the diagonal elements 
or degenerate eigenvalues, this becomes the derivative, 

9 , , , 
Jij — > t— Xi coth aXj . 
oXi 

Unfortunately, this form of the gradient is computa- 
tionally inefficient, requiring 0(A^ 3 ) for each component 
of the gradient, and hence (9(iV 4 ) overall. Taking ad- 
vantage of the properties of the Hadamard product [ll[ , 
however, the gradient can be manipulated to give 



mr 



Tr [Q D J D Q T <9H] 



where D = Diag ((Q T • p) 



If the matrix Q • D • J ■ D • 



Q is first cached, then each component of the gradient 
can be computed in only 0(A^ 2 ) so that the complete 
gradient does not exceed the 0(N 3 ) complexity of the 
decomposition itself. 

Similar Hadamard identities reduce the gradient of the 
log determinant to 

d log \mi\ = Tr [Q (R o J) Q T • OH] , 

where 

R = Diag (\ — Tr — r 

\ A; cotn aXi 

Once again, caching the intermediate matrix, 
Q(RoJ)Q T , enables the full gradient to be com- 
puted in 0(N 3 ). 

These results admit an efficient symplec- 
tic integrator (Appendix [B| for RMHMC; a 
C'H — h implementation is available online at 



http : / /betanalpha . github . com/ j amon/ 
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Algorithm 


Warm-Up 
Iterations 


Samples 


e 


Accept 
Rate 


CPU Time 

(s) 


ESS 


ESS / Time 
(s 1 ) 


EMHMC 


10 3 


10 5 


0.001 


0.999 


1627 


70.3 


0.0432 


RMHMC 








0.946 


6282 


856 


0.136 



TABLE I. When comparing the effective sample size of the latent variable, v, in the funnel distribution, hand-tuned EMHMC 
is over three times less effective than adaptively-tuned RMHMC. CPU time was measured with the clock function in the C++ 
library time. 



III. EXPERIMENTS 

The utility of the SoftAbs metric is best demonstrated 
on complex distributions. Neal's funnel distribution [i~2| 

n 

tt(x, v) = Y[M{xi\0, e- v ) ■ Af{v\0, 9) , 
i=i 

emulates many pathological features of popular distri- 
butions, such as those arising in hierarchical fl3j and 
latent [l4[ models. Note that, by construction, the 
marginal distribution of v is simply v ~ Af(0, 90, inde- 
pendent of n, admitting v and its marginal distribution 
as a simple diagnostic of bias in any sampling procedure. 

In each experiment a Markov chain is randomly ini- 
tialized, qi ~ U(— 1, 1), and then taken through a 
series of warm-up iterations before sampling begins. 
Where noted, the integrator step-size, e, is adapted with 
dual-averaging to yield a target Metropolis acceptance 
rate [lj| . The number of integration steps is set by hand 
to approximate the half-period of the oscillating trajec- 
tories (Figures El [5| . 

Autocorrelations, pi, of v are computed with an initial 
monotone sequence estimator fl6| and the effective sam- 
ple size (ESS) is defined as ESS = I (l + 2 J2l=i Pi) > 
where / is the total number of generated samples. 

The above procedure is applied to EMHMC with step- 
size adaptation, EMHMC without step-size adaption, 
and RMHMC with the SoftAbs metric. 



A. EMHMC with Adaptation 

Despite its simplicity, the funnel demonstrates many 
of the limitations of EMHMC. When adaptively tuned 
to the nominal acceptance rate r = 0.65 [H, the inte- 
grator step-size exceeds the spatial scale of the narrow 
neck; even though the probability mass of the mouth and 
neck of the funnel is comparable, the resulting trajecto- 
ries overlook the neck entirely and bias resulting expec- 
tations without any obvious indication (Figure [2]). 



1 Note the use of the convention N((J,, & 2 )- 



B. EMHMC without Adaptation 

Because we know the truth in this case, we can aban- 
don adaptive tuning and instead tune the step-size by 
hand; a smaller step-size ensures that the trajectories ex- 
plore most of the funnel's probability mass and that the 
marginal distribution p(v) is correct within Monte Carlo 
error. Unfortunately, the funnel also exhibits the limita- 
tions of a position-independent kinetic energy. The vari- 
ation of the potential within the typical set is huge, and 
the meager variation of the kinetic energy dramatically 
restricts the distance of each transition (Figure [3]). The 
EMHMC transitions struggle to cross between the mouth 
and neck of the funnel, and the Markov chain becomes 
little more than a random walk across the distribution 
(Figure |U). 

C. RMHMC with the SoftAbs Metric 

On the other hand, the SoftAbs metric, here with 
a = 10 6 , allows RMHMC to explore the entire distri- 
bution within a single trajectory (Figure Because 
the metric accounts for local curvature, the step-size can 
be adaptively tunec0 without introducing any bias. The 
huge autocorrelations of EMHMC vanish (Figure [6]) and, 
despite the increased computation required for each tran- 
sition, RMHMC yields a more efficient generation of ef- 
fective samples (Table Q|. 

Nominally, the 0(A^ 3 ) computational burden of 
RMHMC is significantly worse than the O(N) burden 
of EMHMC. The pathological behavior of distributions 
like the funnel, however, scales much faster, often expo- 
nentially, and the benefit of RMHMC with the SoftAbs 
metric only increases with dimension. Moreover, this 
concern ignores the burden of computing the potential 
itself which, as in the case of Bayesian posteriors with 
many data, can overwhelm the 0(A^ 3 ) burden entirely. 



2 The increased information encoded in the metric should admit 
a larger acceptance rate, r, for RMHMC than the EMHMC case 
of r = 0.65. Motivated by some simple experiments, here the 
target rate for RMHMC is set to r = 0.95. 




FIG. 2. Adaptive tuning of EMHMC results in an excessively large step-size, preventing trajectories from exploring the neck 
of the funnel and biasing the stationary distribution. Consequently, the samples of v are inconsistent with the marginal 
distribution JV(0, 9). 
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EMHMC (No Stepsize Adaptation) 



(100+1) Dimensional Funnel e = 0.01, L = 5000 




t = e * n/L 



FIG. 3. EMHMC trajectories are limited to AV ~ (n+ l)/2 and consequently explore only a small neighborhood of the funnel. 
Note that the trajectories are oscillatory; half of the largest period of oscillation, here T/2 « 8, defines the optimal integration 
time for maximizing distance, and minimizing autocorrelation, between the samples. 



EMHMC (No Stepsize Adaptation) 




Sample Lag 



FIG. 4. Although not optimal, a smaller step-size tuned by hand allows the trajectories to penetrate the neck of the funnel, 
and the resulting samples of v are consistent with the true marginal, JV(0, 9). In real applications where the truth is not known 
a priori, this sort of hand-tuning is not viable. 
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RMHMC with SoftAbs Metric 



(100+1) Dimensional Funnel e = 0.01, L = 5000 




t = e * n/L 



FIG. 5. The log determinant in the Hamiltonian allows RMHMC trajectories to explore without the restriction to 
AV ~ (n + l)/2, and the dynamic decorrelation/scaling ensures that a single integrator step-size is efficient across the en- 
tire distribution. As in EMHMC, the trajectories oscillate and the half-period of the longest oscillation, T/2 w 25, defines the 
optimal integration time. 



RMHMC with SoftAbs Metric 
1.2 




-15 



Mean = 0.0318454 
Standard Deviation = 2.88865 
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FIG. 6. The far-reaching trajectories of RMHMC dramatically reduce the autocorrelation of the samples which, despite the 
step-size adaptation, are consistent with the desired marginal A/"(0, 9). Note that significantly fewer lags shown here than above. 
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IV. CONCLUSIONS AND FUTURE WORK 

By smoothly regularizing the eigendecomposition of 
the Hessian, the SoftAbs metric admits a general imple- 
mentation of RMHMC robust against the many patholo- 
gies to which EMHMC can be vulnerable. Despite its ap- 
parently steep computational burden, the SoftAbs metric 
allows for practical inference on complex models never 
before deemed feasible. 

Potential to reduce that burden might be found by 
looking deeper into Riemannian geometry. Use of frame 
bundles [17| to transport the Hessian eigenbasis along 
a trajectory, for example, could be more efficient than 
recomputing the eigendecomposition at each point. 

Moreover, further insight into the metric itself may be 
found in understanding the geometric consequences of 
the SoftAbs mapping. What effect, for example, does 
the regularization of the SoftAbs mapping have on the 
geometric curvature of the manifold? Perhaps this pa- 
rameterized regularization be related to Ricci flow flST ]. 
or other smoothing diffcomorphisms. 



APPENDIX A: APPROXIMATIONS TO THE 
SOFTABS METRIC 



A. Approximations 



The SoftAbs metric requires an eigendecomposition 
and all third-order derivatives of the potential, both of 
which can be sufficiently burdensome as to render the 
metric unfeasible for very large problems. Transforming 
various approximations to the Hessian with the SoftAbs 
map, however, yields a series of approximations that of- 
fer less intense alternatives that may be useful for certain 
distributions. 

1. Diagonal 



Ignoring the local decorrelation of the potential and 
instead focusing on the rescaling of the parameters, we 
can simply take 
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with 



mi « Diag (Ha coth aH u ) . 



The computational burden of the resulting evolution 
scales as 0(iV 2 ), much better than the 0(iV 3 ) of the full 
SoftAbs metric, and, because the Hessian of the funnel 
is almost diagonal, the approximation loses little infor- 
mation. Consequently, the diagonal approximation per- 
forms exceptionally well in this case ( Table ITI1 Figures [3 
[8]). Note that, on more correlated distributions the ap- 
proximation will be more limiting and the performance 
will suffer. 



Algorithm 


Warm-Up 
Iterations 


Samples 


e 


Accept Rate 


CPU Time 

(s) 


ESS 


ESS / Time 
(s 1 ) 


EMHMC 


10 s 


10 5 


0.001 


0.999 


1627 


70.3 


0.0432 


RMHMC (SoftAbs) 


10 3 

m3 


10 3 

1 n3 


rt At\ 


0.946 


6282 


856 


0.136 



TABLE II. The diagonal approximation to the SoftAbs metric dramatically out-performs the full SoftAbs metric when applied 
to the funnel distribution, because the Hessian of the funnel potential is almost diagonal. As before, the step-size of the diagonal 
SoftAbs chain was determined with dual-averaging but with a target acceptance rate of r = 0.8. 
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RMHMC with Diagonal SoftAbs Metric 



(100+1) Dimensional Funnel 



e = 0.01, L= 10000 



10 



> 



-5 - 



-10 




Large V 



Small V 



-15 -10 -5 5 10 15 




40 60 
t = e * n/L 



100 



FIG. 7. Although clearly more ragged than the trajectories with the full SoftAbs metric, the trajectories with the diagonal 
SoftAbs metric explore almost the same expanse as the exact metric. 




RMHMC with Diagonal SoftAbs Metric 
1.2 



1 



-15 



Mean = -0.164541 

Standard Deviation = 3.0051 1 



200 
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Sample 



800 



1000 
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0.4 
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FIG. 8. The samples drawn with the diagonal SoftAbs metric are consistent with the true marginal for v, albeit at the expense 
of slightly higher autocorrelations. 
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RMHMC with Diagonal Outer-SoftAbs Metric 



(100+1) Dimensional Funnel 



e = 0.001, L = 50000 



10 



> 



-5 - 



-10 




Large V 



Small V 



-15 -10 -5 5 10 15 



350 r 
300 - 
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100 
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t = e * n/L 



H 
T 

V 




40 



50 



FIG. 9. When a has to be kept small to ensure the stability of the numerical evolution, the benefit of the dynamic metric is 
substantially reduced. The oscillating trajectories become much more complicated and, as in the EMHMC case, the variation 
of the potential over each trajectory is limited. 



RMHMC with Diagonal Outer-SoftAbs Metric 



15 



10 



> 



-5 - 



-10 



-15 



Mean = 2.66789 

Standard Deviation = 1.69274 



200 



400 600 
Sample 



800 



1000 




FIG. 10. The limited variation in potential induces random walk behavior in the diagonal outer-product approximation, while 
the small step-sizes required for stable evolution dramatically reduce the efficiency of each transition. Moreover, dual-averaging 
adaptation of the step-size introduces the same bias seen in adaptive tuning of EMHMC. 
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2. Outer-Product 

A common approximation to the Hessian is the outer- 
product of the gradient (l9j |. 

where g = dV. Propagating this approximation through 
the SoftAbs map gives 



(g • g) 



sinh (a g • g) 



cosh (a g • g) — 1 
(g ' g) 



Note that this is equivalent to the "background-score" 
metric proposed in |l| with functions scaling the rank- 
one update as well as the entire metric. 

Evolution with this metric requires no derivatives be- 
yond the Hessian, and with a careful implementation the 
computation scales as 0(-/V 2 )H. The global coefficient, 
however, renders the metric almost singular in all but 
the simplest problems. 

Along typical trajectories the norm of the gradient is 
far from zero, and the sinh becomes highly nonlinear. 
Consequently, the numerical evolution becomes unstable 
unless a <C 1 or e < 1, at which point the evolution 
becomes impractical. For example, with a = 1 the step- 
size must be smaller than floating-point precision before 
the Metropolis accept rate rises above 0.5. 



3. Diagonal Outer-Product 

The severe non-linearities of the outer-product approx- 
imation can be avoided by considering only the diagonal 
elements of the outer-product, 



H a Diag (gf) , 

which gives the metric 

IHl ~ Diag (gf cothagf ) . 

As above, gt = diV. 

Although this approximation avoids the highly- 
nonlinear coefficient above, the typically large values of 
the gf do induce large gradients in the Hamiltonian evo- 
lution. Taking a = 1 yields stable trajectories, but the 
strong rcgularization reduces the log determinant's abil- 
ity to increase the variation in the potential (Figure [9|) 
and limits the adaptability of the metric. The limited 
adaptability makes the Markov chain vulnerable to bias 
when adapting the integrator step-size and, consequently 
the samples exhibit the same random walk behavior and 
bias towards large v seen in the EMHMC case (Figure 

m. 



APPENDIX B: A SYMPLECTIC INTEGRATOR 
FOR RMHMC 

One of the challenges of RMHMC is that the non- 
separable Hamiltonian requires a more sophisticated, and 
costly, symplectic integrator than EMHMC 0,Hl|. In 
order to construct such an integrator first rewrite the 
Hamiltonian as 



where 



and 



H = T 



r=-p T .S- 1 (q).p 



1 



log|E(q)| + V. 



Both proper scalar functions, r and </> motivate a nat- 
ural splitting of the Hamiltonian which yields the inte- 
grator 



with 



)t O ft O Tt o ft o 

2 2 2 



dj)_d_ 

dq l dpi 
dr d 
dq l dpi 
dr d 



(2) 



T 



dp 1 dqi 



Of the individual operators, only is separable and 
analytically calculable. The non-separable operators, ft o 

Tt°f±, require the implicit generalized leapfrog algorithm 
(Algorithm QJ. Efficient gradient computations complete 
the implementation (Algorithm [J). 



In fact, with automatic differentiation techniques that can com- 
pute Hessian- vector products in linear time 11911 , the computation 
can be reduced further to O(N). 
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P <~ P ~ f90/9q(p,q) 
P <- P 

while Ap > 5 do 

P' «~ P - f • <9r/dq (p, q) 
Ap = maxi {|pi - p-|} 

P^P' 
end while 

(T 4 — q 

while Aq > 5 do 

q' <- o" + f -dr/dp (p,cr) 

Ag = maxi - 

q^q' 
end while 



| ■ <9T/<9q(p,q) 
|cty/9q(p,q) 



(j/, Require: a 

Require: H 

ft o T t o ft Require: 9H 

Require: A, Q 

function dr/dp (p, q) 
return Q (Ar 1 Q 2 



dr/dp (p,q) 



Regularization parameter 
d 2 V/dq i dq j 
d 3 V/dq n dq i dq j 
Eigendecomposition of H 



function ch~/9q(p, q 
A, coth ct\i — 
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Xi Xj 



Xj cothaAj q 



Sij) 



As cothaAifc 
dXi 

Diag((Q T .p).) 
Q D J D Q T 



D 

M 

for n = 1 to N do 



5 n <- iTr[-M-9 n H] 



end for 
return S 



ALGORITHM 1. The spitting © admits a discrete integra- 
tion of Hamilton's equations, approximating the evolution of 
the current position, q, and momentum, p, for a time, At = e. 
Because they are implicit, the first two steps of the general- 
ized leapfrog must be solved with fixed-point iterations; in 
order to avoid position-dependent divergences causing bias in 
the Markov transitions, the iterations are continued until a 
given threshold instead of a set number of iterations. 



function d<f>/dq (p, q] 
Xi coth aXi — 



A.cothaA, 



+ 7TT— Xi cothaAj&j 
aXi 



R <— Diag 



1 



M «- 
for n 

S„ - 



Xi coth aXi 
Q(Ro J)Q T 
= 1 to N do 
- iTr[M-9 n H]+9 n V 



end for 
return S 



ALGORITHM 2. With careful manipulation and caching of 
intermediate matrices, the gradients necessary for Hamilto- 
nian evolution (Algorithm [1]) can be computed in 0(7V 3 ). 
Note that each trace can be computed in 0(iV 2 ) as 
Tr [A • B] = E,j AijBji. 
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